Search CORE

170 research outputs found

Induction of models under uncertainty

Author: Cheeseman Peter
Publication venue
Publication date
Field of study

This paper outlines a procedure for performing induction under uncertainty. This procedure uses a probabilistic representation and uses Bayes' theorem to decide between alternative hypotheses (theories). This procedure is illustrated by a robot with no prior world experience performing induction on data it has gathered about the world. The particular inductive problem is the formation of class descriptions both for the tutored and untutored cases. The resulting class definitions are inherently probabilistic and so do not have any sharply defined membership criterion. This robot example raises some fundamental problems about induction; particularly, it is shown that inductively formed theories are not the best way to make predictions. Another difficulty is the need to provide prior probabilities for the set of possible theories. The main criterion for such priors is a pragmatic one aimed at keeping the theory structure as simple as possible, while still reflecting any structure discovered in the data

NASA Technical Reports Server

Evolutionary tree reconstruction

Author: Cheeseman Peter
Kanefsky Bob
Publication venue
Publication date
Field of study

It is described how Minimum Description Length (MDL) can be applied to the problem of DNA and protein evolutionary tree reconstruction. If there is a set of mutations that transform a common ancestor into a set of the known sequences, and this description is shorter than the information to encode the known sequences directly, then strong evidence for an evolutionary relationship has been found. A heuristic algorithm is described that searches for the simplest tree (smallest MDL) that finds close to optimal trees on the test data. Various ways of extending the MDL theory to more complex evolutionary relationships are discussed

NASA Technical Reports Server

Bayesian classification theory

Author: Cheeseman Peter
Hanson Robin
Stutz John
Publication venue
Publication date
Field of study

The task of inferring a set of classes and class descriptions most likely to explain a given data set can be placed on a firm theoretical foundation using Bayesian statistics. Within this framework and using various mathematical and algorithmic approximations, the AutoClass system searches for the most probable classifications, automatically choosing the number of classes and complexity of class descriptions. A simpler version of AutoClass has been applied to many large real data sets, has discovered new independently-verified phenomena, and has been released as a robust software package. Recent extensions allow attributes to be selectively correlated within particular classes, and allow classes to inherit or share model parameters though a class hierarchy. We summarize the mathematical foundations of AutoClass

NASA Technical Reports Server

Autoclass: An automatic classification system

Author: Cheeseman Peter
Hanson Robin
Stutz John
Publication venue
Publication date
Field of study

The task of inferring a set of classes and class descriptions most likely to explain a given data set can be placed on a firm theoretical foundation using Bayesian statistics. Within this framework, and using various mathematical and algorithmic approximations, the AutoClass System searches for the most probable classifications, automatically choosing the number of classes and complexity of class descriptions. A simpler version of AutoClass has been applied to many large real data sets, has discovered new independently-verified phenomena, and has been released as a robust software package. Recent extensions allow attributes to be selectively correlated within particular classes, and allow classes to inherit, or share, model parameters through a class hierarchy. The mathematical foundations of AutoClass are summarized

NASA Technical Reports Server

Automatic discovery of optimal classes

Author: Cheeseman Peter
Freeman Don
Self Matthew
Stutz John
Publication venue
Publication date
Field of study

A criterion, based on Bayes' theorem, is described that defines the optimal set of classes (a classification) for a given set of examples. This criterion is transformed into an equivalent minimum message length criterion with an intuitive information interpretation. This criterion does not require that the number of classes be specified in advance, this is determined by the data. The minimum message length criterion includes the message length required to describe the classes, so there is a built in bias against adding new classes unless they lead to a reduction in the message length required to describe the data. Unfortunately, the search space of possible classifications is too large to search exhaustively, so heuristic search methods, such as simulated annealing, are applied. Tutored learning and probabilistic prediction in particular cases are an important indirect result of optimal class discovery. Extensions to the basic class induction program include the ability to combine category and real value data, hierarchical classes, independent classifications and deciding for each class which attributes are relevant

NASA Technical Reports Server

Generalized Maximum Entropy

Author: Cheeseman Peter
Stutz John
Publication venue
Publication date: 03/08/2005
Field of study

A long standing mystery in using Maximum Entropy (MaxEnt) is how to deal with constraints whose values are uncertain. This situation arises when constraint values are estimated from data, because of finite sample sizes. One approach to this problem, advocated by E.T. Jaynes [1], is to ignore this uncertainty, and treat the empirically observed values as exact. We refer to this as the classic MaxEnt approach. Classic MaxEnt gives point probabilities (subject to the given constraints), rather than probability densities. We develop an alternative approach that assumes that the uncertain constraint values are represented by a probability density {e.g: a Gaussian), and this uncertainty yields a MaxEnt posterior probability density. That is, the classic MaxEnt point probabilities are regarded as a multidimensional function of the given constraint values, and uncertainty on these values is transmitted through the MaxEnt function to give uncertainty over the MaXEnt probabilities. We illustrate this approach by explicitly calculating the generalized MaxEnt density for a simple but common case, then show how this can be extended numerically to the general case. This paper expands the generalized MaxEnt concept introduced in a previous paper [3]

NASA Technical Reports Server

Automatic classification of spectra from the Infrared Astronomical Satellite (IRAS)

Author: Cheeseman Peter
Goebel John
Self Matthew
Stutz John
Taylor William
Volk Kevin
Walker Helen
Publication venue
Publication date
Field of study

A new classification of Infrared spectra collected by the Infrared Astronomical Satellite (IRAS) is presented. The spectral classes were discovered automatically by a program called Auto Class 2. This program is a method for discovering (inducing) classes from a data base, utilizing a Bayesian probability approach. These classes can be used to give insight into the patterns that occur in the particular domain, in this case, infrared astronomical spectroscopy. The classified spectra are the entire Low Resolution Spectra (LRS) Atlas of 5,425 sources. There are seventy-seven classes in this classification and these in turn were meta-classified to produce nine meta-classes. The classification is presented as spectral plots, IRAS color-color plots, galactic distribution plots and class commentaries. Cross-reference tables, listing the sources by IRAS name and by Auto Class class, are also given. These classes show some of the well known classes, such as the black-body class, and silicate emission classes, but many other classes were unsuspected, while others show important subtle differences within the well known classes

NASA Technical Reports Server